Impact of mobile connectivity on students’ wellbeing: Detecting learners’ depression using machine learning algorithms

Depression is a psychological state of mind that often influences a person in an unfavorable manner. While it can occur in people of all ages, students are especially vulnerable to it throughout their academic careers. Beginning in 2020, the COVID-19 epidemic caused major problems in people’s lives by driving them into quarantine and forcing them to be connected continually with mobile devices, such that mobile connectivity became the new norm during the pandemic and beyond. This situation is further accelerated for students as universities move towards a blended learning mode. In these circumstances, monitoring student mental health in terms of mobile and Internet connectivity is crucial for their wellbeing. This study focuses on students attending an International University of Bangladesh to investigate their mental health due to their continual use of mobile devices (e.g., smartphones, tablets, laptops etc.). A cross-sectional survey method was employed to collect data from 444 participants. Following the exploratory data analysis, eight machine learning (ML) algorithms were used to develop an automated normal-to-extreme severe depression identification and classification system. When the automated detection was incorporated with feature selection such as Chi-square test and Recursive Feature Elimination (RFE), about 3 to 5% increase in accuracy was observed by the method. Similarly, a 5 to 15% increase in accuracy has been observed when a feature extraction method such as Principal Component Analysis (PCA) was performed. Also, the SparsePCA feature extraction technique in combination with the CatBoost classifier showed the best results in terms of accuracy, F1-score, and ROC-AUC. The data analysis revealed no sign of depression in about 44% of the total participants. About 25% of students showed mild-to-moderate and 31% of students showed severe-to-extreme signs of depression. The results suggest that ML models, incorporating a proper feature engineering method can serve adequately in multi-stage depression detection among the students. This model might be utilized in other disciplines for detecting early signs of depression among people.


Introduction
Depression is a state of the human mind that often reaps undesirable effects on an individual's feelings, motivations, and actions.It gradually inhibits the day-to-day activities and attention of a person and leads to a general loss of pleasure and interest in work.This state of mind often serves as a contributing factor for many psychological and physical complications such as diabetes, heart disease etc. [1,2].Approximately 3.8% of the total population worldwide undergoes this sort of mental disorder [3].In Bangladesh, 6.4 million people (4.10%) experience different depressive disorders, and more than 75% of people in such Low to Middle-Income Countries (LMICs) fail to receive proper care for their mental health [4,5].In addition, mistreating depression can instigate self-destructing tendencies and often lead to the most common psychiatric illnesses linked to abnormal death [6][7][8].A biochemical or neurophysiological response (e.g., Monoamine turnover-noradrenaline, serotonin, or 5-HT) is also assumed to relate to depression in the individuals [9].
Since the beginning of 2020, the Covid-19 pandemic caused a paradigm shift to people's lives and introduced unprecedented changes to their usual lifestyles.For example, all in-person activities were suspended across the educational institutions substituted by an online education system that was introduced within a brief time.The hands-on experience was replaced with the continuous use of digital tools and learning over the internet via mobile screens.Research reported that initially the students were struggling to cope with the situation [10], but by the end of that year had become habituated to the remote learning system [11].
Although the hybrid mode of learning was found to be promising in the event of a disruption in academic processes [12], its applicability needs to be tested for different populations.LMICs such as Bangladesh lack the facilities of a robust infrastructure such as a stable electricity grid, adequate internet connections, sufficient on-campus resources, technical knowledge, and sufficient competency of teachers to support the students.Multiple reports showed that students were negatively impacted by these issues.
Under this new learning context, monitoring a student's mental health is essential to the process of providing a favorable learning environment for promoting well-being.Although depression can be identified among all age groups, students are particularly susceptible to it as the new education delivery and management system affects them the most.A student can experience depressive episodes of single, recurrent, or bipolar nature [13] which can greatly affect their psychological well-being as well as social-work life.Students pursuing their undergraduate degrees need to coordinate a lot of tasks and responsibilities such as regular coursework, lab assignments, optional projects, etc., which can become cumbersome and depressing at times [14].Thus, university authorities should be aware of student mental health in this new learning environment.
Depression, anxiety, and stress scale (DASS-21) is a popular tool to measure students' depression.Numerous studies have been performed to assess the validity and reliability of this tool [41][42][43][44].Javaeed et al. found a mild positive correlation between internet addiction and depression when employing the DASS-21 scale in undergraduate medical students of Kashmir [45].Prediction of depression was accomplished by Priya et al. who employed five different machine learning algorithms on the data collected using DASS-21 scale while the Naïve Bayes algorithm achieved the best accuracy [46].A similar prediction of anxiety, depression and stress using DASS-42 scale was also revealed by Kumar et al. in five severity levels using eight machine learning algorithms [47].Oei et al. proposed a revised DASS-18 scale for the Asian population to overcome any cultural influence [48].
Mutalib et al. performed a similar analysis on higher education students in Malaysia by utilizing the World Health Organization Quality of Life (WHOQOL) tool to identify factors that are responsible for mental health problems [49].Elhai et al. researched problematic smartphone use (PSU) of Chinese undergraduate students to identify the severity of mental health and its correlation with psychopathology variables [50].Muzammel et al. investigated the effect of phoneme vowels and consonants for recognizing clinical depression from speech [51].Yasin et al. made a review study on EEG-based major and bipolar depressive disorder detection where the state-of-the-art neural networks were thoroughly discussed [52].Othmani et al. performed several studies on depression detection from speech and audio-visual data [53][54][55].They explored a Model of Normality in producing results more efficiently by examining the audio-visual patterns of a depressed subject to that of a symptom-free one.They also proposed EmoAudioNet for both emotion and major depressive disorder detection and a correlation-based network where depression relapse is detected.Richter et al. [56,57] designed two studies using a machine learning approach to identify the differences in behavioral patterns stemming from anxiety and depression disorder.
In the context of Bangladesh, Islam et al. [58] performed a statistical analysis of adolescent mental health by employing g a Patient Health Questionnaire (PHQ-9) and Generalized Anxiety Disorder Assessment (GAD-7) scale to ascertain mental health conditions prevalent among the students.Ahmmed et al. [59] attempted to identify the effect of the global pandemic on the psychological well-being and academic studies of undergraduate students.Using ML algorithms, Ahnaf et al. [60] studied a sample of 577 undergraduate students to determine the reasons for the frequent occurrence of depression in this cohort.Munir et al. [61] tried to address the inadequate information regarding depression predictors using ML classifiers.Nayan et al. [62] intended to differentiate between anxiety and depression during the first wave of coronavirus among university students and achieved high accuracy (91.49%) of detection with Support Vector Machine classifier [63].One of the notable contributions came from Zulfiker et al. [63] who employed six machine learning classifiers using various socio-demographic variables and psychosocial information to predict depression among the students.They used different feature selection methods and Synthetic Minority Oversampling Technique (SMOTE) for extracting relevant features of the ML models.
The current study aimed to investigate the students' digital well-being when the third variant of coronavirus, Omicron was rampant.It is anecdotally hypothesized that the uncertain lockdowns that were imposed during the previous variants impacted the mental health of many students.Thus, many of them may have experienced depression and other symptoms which may partly be attributed to the hybrid mode of learning during the pandemic and beyond.Specifically, the new normal accelerates the requirements of mobile and Internet connectivity among the students in their daily academic and social lives.In response to the interaction occurring between mental well-being and new digital reality, it appears no studies have been dedicated to investigating the impact of continuous mobile connectivity on students' mental health in the context of Bangladesh [65].Thus, a larger study was conducted employing a survey scale to investigate depression, anxiety, stress, and mental wellbeing of the students related to their mobile and internet connectivity.This current study only reports the depression component of the larger study.

Data acquisition
2.1.1Questionnaire design.A digital well-being questionnaire was developed for the larger study consisting of 72 items in total [64].The survey was circulated among the students using a Google form.Participants' voluntary consent was collected prior to the survey being conducted.There were 5 items dedicated to collecting demographic information: gender, age, residence, ethnicity/geographical location, and academic level of study.Two items were targeted to obtain data related to sleeping hours and the amount of time students were connected to the Internet.Two items inquired into student satisfaction with sleep and academic study.The depression component of the questionnaire consists of 7 items.Each of these 7 items was answered on a 4-point rating scale along a continuum from Not at all true (0) to Always true (3).
This study particularly focused on depression for which the DASS-Depression subscale had been adopted.DASS-Depression has been adopted from DASS-21 which was established by Lovibond et al. in 1995 to investigate the depression of a cohort [65].The 7 items that were utilized for this study are: I could not seem to experience any positive feeling at all; I found it difficult to work up the initiative to do things; I felt that I had nothing to look forward t; I felt downhearted and sad; I was unable to become enthusiastic about anything; I felt I was not worth much as a person; and I felt that life was meaningless.

Ethics statement.
This study is approved by the Committee for Advanced Studies and Research (CASR), Islamic University of Technology (IUT), Bangladesh.The consent was collected from the participants at the beginning of the survey.

DASS-Depression scale.
The Depression-Anxiety-Stress Scale (DASS) is a widely used survey instrument to evaluate depression experienced by the people.There are other numerous scales for depression screening such as Beck Depression Inventory (BDI), Patient Health Questionnaire (PHQ), Hamilton Rating Scale, Geriatric Depression Scale (GDI), etc.However, DASS can be regarded as a benchmark scale to evaluate the depressive mental condition of [66].The full scale consists of 42 items which are divided into three subscales for measuring depression, anxiety, and stress.The reduced 21-item-based DASS-21 scale also exhibits similar psychometric properties as the larger scale.As discussed earlier, this scale has been validated among different populations of the world and has a good internal consistency among the depression-related items (Cronbach alpha, α = 0.94) [67].Unlike some other scales, it can differentiate mood disturbances and core depression symptoms such as dysphoria, hopelessness, self-deprecation, inertia, etc. so that this characteristic has the potential to help medical practitioners to gain a better understanding of a pupil's mental health [67,68].The 7-item DASS-Depression scale is one of the three self-report subscales of the DASS-21.The scoring for each item is based on a 4-point severity scale and the sum of the 7 items' scores is multiplied by 2 to find out the depression severity score of each participant [69].These five depression severity levels are determined as follows: • Normal (scores 0-9)-labelled as 0 • Mild (scores 10-13)-labelled as 1 • Moderate (scores 14-20)-labelled as 2 • Severe (scores 21-27)-labelled as 3 • Extremely Severe (scores 28+)-labelled as 4 2.2 Experimental design 2.2.1 Survey objective.The primary objective of this survey was to assess the mental condition of the students due to their increased exposure to digital tools, mobile connectivity and internet activities associated with the hybrid or distance mode of learning.Besides, sleep data and internet connectivity status were also collected to see whether there was an irregular connection between the hybrid mode of learning and physical activity.
As a general guideline, participants were instructed to not think about a question for too long and all the items related to the mental condition were to be filled in based on the experience of the past 4 weeks.At the end of the survey, a total of 444 responses were submitted and the data was then saved into a.csv file for further processing.

Exploratory data analysis.
Of the 444 respondents, 291 were male and 153 were female participants ranging from below 20 years to over 35 years of age.University students usually enroll after the age of 17 years.Thus 'below 20 years' implies the range between 17 to 20 years.The survey was primarily focused on the students at the Islamic University of Technology (IUT) situated in Gazipur, Bangladesh.As a result, most of the participants were aged between 21 to 25 years.Besides, since the university operates under the umbrella of an international organization named Organization of Islamic Cooperation (OIC); international students from more than 20 countries such as Iran, Pakistan, Brunei, Indonesia, Uganda, Nigeria, Cameroon, Gambia, Yemen and so on were attending this university under this international banner.These participants particularly have experienced their academic activities being transitioned from remote learning to hybrid mode.Moreover, the participants ranged from undergraduate-first year to postgraduate studies and from several types of contexts (e.g., small towns, big cities), which added to the authenticity of the results obtained through the survey.First-year students were comparatively higher in number than the other four academic levels and the significantly higher number of respondents came from big cities rather than from villages or small towns.The overall data distribution is shown in Tables 1-5.
An elaborated analysis with the findings of the relationship between survey data and depression severity is demonstrated in the next section as stated in the research contribution.

Data pre-processing
Before training different algorithms with the dataset, several pre-processing steps need to be performed.For calculating the mean and standard deviation values from DASS-Depression scale, Microsoft Excel was used to perform descriptive statistics.All the coding simulation was executed on the Jupyter Notebook of the Python Navigator platform.At first, the input-output was separated, and some irrelevant columns were dropped.The dropped columns' presence was not affecting the accuracy value in one way or the other, for which they were eliminated in favor of improving accuracy, F1-score, and ROC-AUC.In the.csv file, the 'Result' column was used as output.There were 41 columns in the preprocessed dataset.3 columns were dropped from the dataset which are the timestamp, voluntary consent, and sum.The remaining 38 columns were used as input.Then, alphabetical data were converted into numerical values using label encoding.It refers to converting the labels into a numeric form to convert them into a machine-readable form.In this way, ML classifiers can decide in a better way about how those labels must be encoded without increasing the dataset dimensions [70].It is an important preprocessing step for the structured dataset in supervised learning.After that, the MinMaxScaler from scikit-learn library was applied to the dataset directly to normalize the input variables.MinMaxScaler scales all the data features in the range [0, 1] if all data are non-negative, otherwise in the range [-1, 1] if there are negative values in the dataset.This scaling compresses all the inliers in the narrow range [0, 1].As there were no negative values in the dataset, the default configuration was utilized and scaled values were obtained in the range [0, 1].When the upper and lower boundaries are well defined from domain knowledge, MinMaxScaler is typically utilized [71].One striking feature of the MinMaxScaler is that it preserves the shape of the original distribution without tampering with the information included in the original data.It should be noted that the MinMaxScaler does not lessen the significance of outliers.The feature returned by the MinMaxScaler has a default range of values between 0 and 1. Whenever the data possess a bounded range of values or the distribution is not Gaussian, the Min-MaxScaler is widely adopted [72].As the dataset of this study was also not following the Gaussian distribution, the Min-Max method was adopted for feature scaling.First, a MinMaxScaler instance was defined with default hyperparameters.Once defined, the fit_transform() function was utilized to generate a transformed version of the dataset.

Feature engineering
After the pre-processing of the dataset, the methodology was divided into two parts.In the first part, all 38 features had been used to train the eight ML models named previously without including a feature selection or feature extraction step.However, to determine how substantial this automated approach was in terms of performance, several feature engineering methods were added in the second part to see the performance change.The adopted feature selection methods were the Chi-square test and RFE, whereas the selected feature extraction methods were PCA and SparsePCA [73,74].These methods were yet to be tested in the literature unlike some other methods such as SelectKBest, and Boruta [61,63] and hence, this study was aimed to investigate the difference in the results among them.
2.4.1 Feature selection methods.Feature selection methods select the most apposite features by disregarding the irrelevant ones so that the ML models do not suffer from mediocre performance on high-dimensional data [75].It is also more interpretable to the researchers as it reduces the training time.In this study, 30 features out of 61 features were selected using both the Chi-square test and RFE method which are discussed below.
Chi-square test.In the Chi-square (χ 2 ) test, the observations are categorized into classes that are mutually exclusive in conventional applications.The test statistic generated from the observations follows a χ 2 frequency distribution if the null hypothesis, that there are no significant differences between the classes in the population, is true.A random sample of n observations from a population is divided into k mutually exclusive classes, each with a set of observed numbers, x i (for i = 1,2,. ..,k).The dependency of two events is tested by using this formula [75]: where, x i = observed value, and expected value, m i = np i , n being the number of samples.And, Here, high χ 2 value means the feature set has significant relevance with the output and thus that feature set can be used for the training of the ML models.
Recursive Feature Elimination (RFE).RFE method falls under the filtering category of using a proxy measure to evaluate a feature subset.In place of a clear best feature subset, many filters offer a feature ranking, with cross-validation used to determine the ranking's cutoff point.RFE is one such method which keeps only the best key features by calculating the weights of each feature for a given output.As a result, the model complexity gets much reduced and can be run efficiently by training on an optimized set of features.The weighted voting scheme follows this equation [76]: where, D(x) = decision function, w = weight vector and μ = is the mean vector over all training patterns.The ranking is determined by using the following equation in an iterative process: where, J = cost function.The iterative algorithm is as follows: 1. Train the classifier first by optimizing the weights w i with respect to J.
2. Determine the DJ(i) or (w i ) 2 ranking criterion for each feature.
3. Discard the component with the lowest ranking criterion.

Feature extraction methods.
In the case of feature extraction, a new set of features is used for model training, which are extracted from the existing features.The outcome is similar to feature selection, that is, a reduced and concise set of features capable of predicting the outcome most precisely.The methods also help to save memory and computational power while producing adequate performance.Though feature extraction methods are vastly employed for image processing purposes, discrete valued features can be tested as well to explore their effectiveness [77].The feature extraction methods applied in this study were PCA and SparsePCA, both of which are discussed below: Principal Component Analysis (PCA).PCA is a classical statistical method for transforming attributes of a dataset into a new set of uncorrelated attributes called principal components.PCA can be used to reduce the dimensionality of a dataset, while still retaining the variability of the dataset as much as possible.In PCA, a matrix X, which is a N×p matrix, serves as the dataset's representation.All the observations for one attribute are contained in each column, X j .The set of inputs X 1 ,X 2 ,. ..,X n is transformed by PCA into another set of column vectors T 1 , T 2 ,. ..,TN .A p×p matrix P specifies this linear transformation of the matrix X such that the transformed variables T are given by: Alternatively, the equation can be rewritten as: where P is known as the loading matrix, and columns of this matrix can be calculated as eigenvectors of the matrix X T X [77].
Sparse Principal Component Analysis (SparsePCA).By adding sparsity structures to the input variables, SparsePCA expands the traditional PCA approach for reducing the dimensionality of data.The principal components of standard PCA are often linear combinations of all input variables, which is a particular drawback.By identifying linear combinations with a smaller number of input variables, SparsePCA gets around this drawback [78].The sparse factors that account for the greatest amount of variance can be expressed as follows: where, z2R n , ρ = parameter controlling sparsity, Card(z) = the number of non-zero coefficients of z.There are several solutions to this problem, one of which is Greedy Search Algorithm [78].For this algorithm, the input would be ∑2R n×n .The steps are as follows: 1. Preprocessing: Sort variables using diagonal elements in decreasing order, then permute ∑ elements, as required.Perform the Cholesky decomposition calculation ∑ = A T A.

Initialization:
4. Set I kþ1 ¼ I k [ fi k g and compute x k+1 as the leading eigenvector of P j2I kþ1 a j a T j .

Classification models
Before building the ML models, 5-fold cross-validation was performed on the train and test set.In this way, the dataset was divided into 5 distinct segments in the first of which 80% of the dataset was used for training and the remaining 20% was used for testing.For the next segment, the 20% testing set used in the first segment got included in the 80% training set and another 20% set, which was not part of the testing dataset in the first segment, was used for the evaluation step.This process was repeated 5 times in total independently to ensure that one specific set does not get selected for training and simultaneously, only a specific set is not used for testing purposes.As a result, the possibilities of overfitting issues were minimized and the mean value of the 5 folds was taken as the final score for each of the 3 metrics.Thus, the previously mentioned 8 ML models were trained using the training set and the models' performance was evaluated using the test set.A brief description of each of the models is given below: 2.5.1 Gaussian Naïve Bayes.It is a classifier that uses only probabilistic models and the Bayes theorem.The fundamental presumption in this situation is that the features for which an outcome is anticipated are independent of one another [79].The following is the classifier formula: The probability estimate will be zero if a feature never appears in any categories in the training data [54].This happens because the probability estimate is related to the number of times a feature's value occurs.

Logistic Regression.
The "logistic function" is used as the cost function in this algorithm, which is based on the idea of probability.The algorithm's nature limits this cost function to a range between 0 and 1, and its formula is as follows: To distinguish one class from another in a classification task, a threshold probability value is specified [80].The same general idea applies to multiclass regression, with the exception that there are M possible outcomes rather than simply two.

Random Forest.
The decision tree concept is used in random forests, which is extended further by incorporating many different individual trees.A weight is assigned to each of these trees' results, and the sum of those weights serves as the foundation for determining the anticipated class through voting.The bagging formula is as follows: where, f b = classification tree, B = number of bagging (1, 2. .., b), x 0 = training set.This algorithm offers more accuracy compared to other classification techniques because it is an ensemble classifier that takes help of multiple decision trees [81].

Support Vector Machine (SVM).
This technique locates a hyper-plane that divides a collection of points in an N-dimensional space depending on features.To secure the classification process for the future, the hyper-plane must maximize the margin between the classes of data.Support vectors are the values closest to the hyper-plane in this case, impacting the margin's position and orientation.The equation for linear SVM, among other types, is as follows: Here, w T = normal vector to the hyperplane, and b determines the offset distance from the origin.SVM has the advantage of being less prone to the over-fitting issue than other algorithms [82].2.5.5 Light Gradient Boosted Machine.Shortened as LightGBM, this algorithm is developed using a decision tree algorithm and has the ability to reproduce results like a boosting algorithm (XGBoost).Regarding efficiency and memory usage, LightGBM employs a highly efficient decision tree learning technique based on histograms.As an ensemble algorithm, any arbitrary differentiable loss function and the gradient descent optimization procedure are used to fit the models.A few steps of the algorithm Gradient-based One-Side Sampling technique used in LightGBM is given below [83]: Multi-layer Perceptron or MLP performs similarly to a supervised learning method by adding backpropagation during training.Specifically, this feedforward network requires at least three layers of nodes namely, the input layer, hidden layer, and output layer.One of the activation functions utilizing the hyperbolic tangent function is as follows: Here, node output is represented by y and v = weighted sum of the inputs.The node weights that are to be modified in the backpropagation process in accordance with the change that reduces overall output error, follows this equation [84]: where, e j = degree of error in an output node j while n represents a datapoint.2.5.7 CatBoost.Another boosting framework that has been used in this study is CatBoost which is one of the most frequently used ML models in the modality.In contrast to LightGBM which has a probability of losing information for clustering tail categories according to expected target value (target statistics), CatBoost uses the target statistics as new numerical features.These features follow an ordering principle in which a random permutation is introduced for the training samples and the preceded memory is used to calculate target statistics [85].An example of a training and a testing sample are as follows: where, σ signifies a random permutation of j or k samples to satisfy the target statistics equation.

K-Nearest Neighbor (KNN).
Finding the data points that are close to one another is how the KNN algorithm operates.To compare a new point to k numbered reference points in order to establish how close it is, Euclidean distance is typically utilized.The k value in this case determines the amount of stability the algorithm will maintain in precise predictions [86].A point in a feature space that should be given to the category of closest neighbors is: where, x = a point in space, and Y 1 = a class the point is assigned to.It is more straightforward to construct compared to other models because no additional suppositions or adjusting are required.
The parameters used for each of the classifiers can be seen in Table 6.Other than the LightGBM and Catboost classifiers, which have their own libraries, scikit-learn was the library to call the parameters and to set the values accordingly.All the features of engineering methods followed these values.

Hyperparameter optimization
For achieving better results, RandomSearchCV was employed for optimizing the hyperparameters for all eight classifiers.In Random Search, a random combination of hyperparameter values is used to train the model, which performs excellently for discovery and obtaining the combinations that are hard to predict [87].Unlike Grid-search which frequently takes more time to complete as it tests every combination, Random Search tends to be efficient most of the time.The hyperparameter optimization problem can be addressed in the following way: where, f = hyperparameter, F = set of trial values (f 1 ,. ..,f S ), ψ = hyperparameter response function.A flowchart for finding the optimal combination of hyperparameter in random search is given in Fig 1: This step is an experimentation step for the study participants in this study and so, the hyperparameter combination can be different for other study participants.

Performance analysis
The metrics that were considered to evaluate the ML classifiers were accuracy, F1-score, and ROC-AUC.These metrics illustrate the performance of the ML models holistically.The confusion matrix is also another way to show the results in terms of distinct numbers.
2.7.1 Accuracy.This metric is the most widely used metric to measure the correctness of the ML model, that is, the proportion of accurately anticipated observations to all observations.Although accuracy is one of the mostly used performance metrics in the field of ML, two more metrics namely, F1 and ROC-AUC scores are also adopted by the researchers to evaluate the outcome of an algorithm [88].The formula for accuracy is: Here, TP and TN denote the number of positive and negative instances that are accurately classified in the multi-class classification of this study.In this type of classification, a threshold is set for each class and any prediction that is out of the threshold falls into a misclassification.2.7.2 F1-score.This metric displays the precision and recall weighted average and the F1-score is measured to balance recall and precision scores [89].Moreover, the value range of the F1-score is between 0 to 1.0 which can be calculated using Eq (18). Here,

ROC-AUC.
ROC-AUC is the value of the area under a Receiver Operating Characteristic (ROC) curve which is calculated using True Positive Rate (TPR) and False Positive Rate (TPR) obtained by Eqs (19-20) As opposed to a threshold-based metric like accuracy, the AUC value indicates a classifier's performance by using ranking, and sometimes, it exhibits a better metric for evaluation [90].
For the multi-class classification task, the performance is measured using this formula [91]: where, Â is the probability operator, i and j is a pair of classifiers, c is the number of classes.In general, the one vs.all approach can be utilized to plot N number of AUC curves for N number of classes in a multi-class classification problem [92].
Lastly, an overall workflow diagram is shown in Fig 2, and all the findings are exhibited and conscientiously analyzed in the following section.study groups.Here, all the illustrations have been shown using boxplots from the Python matplotlib library and seaborn package.A boxplot is a graphical method of showing the data distribution in terms of range, locality, and skewness by dividing the amount of data into four quartiles [93].In the graph, the box depicts the data from the first quartile to the third quartile with a whisker (the line protruding from the sides of each box) pointing out the minimum and maximum values, if applicable.

Gender, age-groups, and ethnicity-based depression. An in-depth analysis was performed to better understand the connection between depression severity and the different
Firstly, there were about twice as many data points for male participants as for female participants, as shown in Table 7.When it comes to depression levels, more than half of the data points from females gathered under the 'Moderate' and 'Extremely Severe' categories compared to that of the males, as shown by the median line in Fig 3 .Overall, no depression was observed among about 44% of the accumulated participants while mild-to-moderate and severe-to-extremely-severe depression were observed among 25% and 31% of the population, respectively.For the age groups, it can be seen from Table 8 that there was a rise in the 'Extremely Severe' level for the participants between 21 to 25 years.The median line shown in Fig 4 also demonstrates the relatively high level of depression in this group compared to two other groups which are below 20 years and participants between 26-to-30-year.Lastly, though a large number of participants were inhabitants of Bangladesh, there were international participants whose depression severity can be observed in Tables 9 and 10.South Asian participants from countries other than Bangladesh, Pakistan, and Afghanistan.were in a state of moderate depression compared to mild depression of these countries' participants.Notable mention here is that most participants from the Southeast Asian countries fell under the 'Moderate' to 'Extremely Severe' category and as a result, the boxplot (Fig 5) has been shifted up to those levels.,The least amount of depression can be seen in the participants from Africa, more than two-thirds of whom were facing no depression.

Academic level-based depression.
The prevalence of depression among the students of different academic levels ranging from first-year students to postgraduate was identified; the results are presented in Figs 6, 7 and Table 11.It was found that a randomly selected student in his/her 2 nd or 3 rd academic year has a larger probability of facing a higher severity of depression than a 1 st year or postgraduate student, as shown in Fig 6 .Postgraduate students particularly experienced the least amount of depression during the hybrid mode of learning (median line at 'Normal'), immediately followed by undergraduate-fourth year students (median line at 'Mild').This can be counted as a valid comparison as the percentages of the four academic level students excluding the first year were relatively adjacent to each other (Table 11).Meanwhile, one finding also suggests that if the students were satisfied with their examination results and other academic activities, they fell under the 'Normal' category more than the students who were dissatisfied.In fact, half of the students who were not enjoying the academic curriculum were at the 'Moderate' depression level contrary to half of the contented students who experienced no depression at all (Fig 7).This is a crucial finding since this study was intended to investigate the applicability of hybrid learning which should be able to cater for a broad student demographic.

Residence and physical activity-based depression.
Depression was found to be more prevalent in big cities than in villages or small towns, which can be inferred from Table 12.Big city dwellers lived with a 'Moderate' median level of depression compared to the 'Normal' median level of the small town or village habitants, shown in Fig 8.This finding is backed up by internet connectivity status, empirically, which is more ubiquitous in the cities than in villages.Moreover, the respondents who were continuously connected to the internet were found to be slightly more depressed (Fig 9).The individuals who were suffering from lack of sleep were more depressed than those who were satisfied with the quality of their sleep.In fact, a gradual decline in depression can be seen among the respondents as their sleep satisfaction increased, as arrayed in Fig 10 .Finally, those whose sleep duration was between 7-9 hours, were found to be experiencing less depression (median) than those who were oversleeping (more than 9 hours) or under sleeping (less than 7 hours), as shown in Fig 11. 3.1.4Statistical summary.The overall summary of the statistical data analysis is represented in Fig 12.The correlation between depression severity and the input variables is shown in terms of attribute importance by performing the Chi-squared test.The letters between the parenthesis designate the column in the dataset used in this study.Gender played a key role as a demographic variable when it comes to calculating depression severity scores.Age and residence are also seen to be affecting the severity score.However, the questionnaire items were found to be more crucial when determining a person's depression severity.

Automated detection performance of the ML models
For the first test, we have taken all features to train the models, and based on the findings shown in Table 13, SVM achieved the highest accuracy of 85.74% and F1-score of 0.8492.However, the CatBoost ROC-AUC score (0.9846) before hyperparameter tuning was the highest among other models' AUC scores.The worst performing classifier was KNN, achieving the lowest score of all three metrics.Other than that, the accuracy values are largely close to each other along with the ROC-AUC score which can be seen in Figs 13 and 14 respectively.

ML models with feature engineering results
To achieve a better performance out of the ML models and improve the automated detection process, different feature engineering methods were incorporated as discussed in the methodology.The comparative outcome among different features of engineering algorithms is presented next.

Comparison between Chi-square test and RFE.
From the feature selection results in Tables 14 and 15, SVM has achieved the highest score for both methods most of the time in terms of accuracy, F1-score, and AUC.A notable point here is that before tuning, SVM performed inadequately and came out in the last place among classifiers, which has been reversed after tuning.The only exception in which this classifier was outperformed by another classifier is in the Chi-square test by the LR model which achieved the highest accuracy of 88.74%.Having said that, KNN once again exhibited the lowest scores among all three metrics for both   16 and 17).Nevertheless, the SparsePCA method experienced a different model, the CatBoost, to overperform and become the best performing model among all, which was followed by LGBM with a marginal gap.Lastly, KNN again became the most subdued performing classifier for both feature extraction methods.Hence, it can be concluded that SparsePCA showed better performance than the PCA method, and the performance comparison is also depicted in Figs 17 and 18.
In addition to the chart comparison, the mean value of the confusion matrices generated from all the tuned ML models when incorporating the SparsePCA feature extraction method have been shown in Tables 18-20.

State of depression among the study groups
In this study, a machine learning approach was employed to detect students' depression due to their continual connectivity with mobile devices and the Internet.A multi-class classification was performed on five different severity levels using the DASS-Depression scale.It was identified that the male students were comparatively less affected than their female counterparts, a finding consistent with the results of previous studies [63,94].The younger students were substantially more vulnerable to a higher degree of depression, classified as severe than were other students.In terms of ethnicity, it is hard to conclude which student group was more prone to experiencing issues associated with depression due to the different sample sizes of each ethnic group.
Previous studies revealed that students were more prone to be diagnosed with depressive symptoms from moderate to extremely severe levels compared to other cohorts of people [95].This situation became worse after the appearance of the COVID-19 epidemic where mild to extremely severe level depression was observed among 46% of the students [96].Specifically, among university students, the percentage was raised to 72% [97].Students who thought themselves to be inferior performers in online education were more likely to be depressed more than the others [98].In this study, considering the impact of continual mobile connectivity, less than half of the students (43.4%) were found to be at the mild-to-extremely-severe level of depression.This can serve as positive feedback for the hybrid mode of learning as  previous studies reported higher percentages [97].However, this might also be partially explained by another study which revealed that about 32% of the students were facing depression symptoms after reopening the campus [99].However, this study also shows that satisfaction with academic activities was highly correlated with the students' depression severity level.
In addition, sleep satisfaction and internet connectivity were found to be crucial factors for depression prevalence among the students [98,100].In this study, participants with low sleep satisfaction were found to be in higher depression severity classes, and consistent internet connectivity was found to have a further influence on the depression severity level.
It is worth mentioning that a significant portion of the students observed in this study had none to very minimal depression levels.For example, 44% of the data belonged to Group 0 whose students were normal in terms of their depression level.However, more than half of the students may have experienced mild to severe level depression at some point in their lives, for example, 19% of students had extremely severe (group 4), 12% had severe (group 3),16% had moderate (group 2), and 9% had mild (group 1) depression level.It is worth mentioning that the data was not balanced in the current study as per the standard practice.This study was conducted considering real-life applications where the balanced student distribution in depression levels was rarely observed [101].As a result, the number of students having no (or mild) stress might be larger than the number of people having high stress, or vice versa.For instance, 44% of the participants had no depression issues, which was the greatest among all the categories under consideration.A similar pattern of data variability is observed in the literature which held the same assumption, i.e., most people around the world possess none to mild depression issues [102][103][104].It has been observed that, in this study, the adoption of data variability in different depression levels does not have any significant impact on the accuracy level of the ML algorithms.

Best performing ML pipeline
When analyzing the performance of the algorithms, it can be said that the metrics showed quite high value for most of the algorithms.Therefore, if the attribute number gets higher or the feature-set becomes more non-linear in nature, the prediction of depressive disorder in multiple classes (mild, moderate, severe, extremely severe) can be achieved with relatively high accuracy.Furthermore, tuning the hyperparameters enhanced the results in most of the cases.One notable characteristic of the algorithm is the SVM's popularity and use in clinical applications [105].Most of the research work has also employed SVM which can lead to oversight of the new and updated algorithms' implementation and performance analysis.
The research studies from Bangladesh, which have been laid out chronologically in Table 21, mostly performed a binary classification of identifying whether a person is depressed or not.In this study, the multi-class classification performance is like the binary classification of other studies [106,107].The performance of the KNN algorithm was lower than other algorithms in several studies as well as in this study, which can be regarded as KNN being not able to fit the data properly in the depression studies [108,109].Other than that, traditional classifiers have been satisfactory in terms of exhibiting higher performance.

Effect of feature engineering on ML models
Few studies from Bangladesh have paid attention to the feature engineering step, which benefits the automated detection pipeline.The current study and the one by Zulfiker et al. [63] both found the boosting algorithms' performance to be higher than traditional ML classifiers by incorporating this step.The accuracy rose by about 3% with the Chi-square test feature selection method and about 5% with the RFE method.Meanwhile, the PCA saw a 5% increase in accuracy and for SparcePCA, a significant jump of 15% could be observed from the standard performance of the ML algorithms.This finding needs to be further investigated in future studies so that more clinical appliances benefit from feature engineering methods.However, there is less variation in the selection of the feature engineering methods, which also needs to be broadened in future studies.The following Table 21 summarizes the comparison of the current study with other research studies on automated depression detection in Bangladesh.

Implications of the study
This study is intended to make a three-fold research contribution.First, student digital wellbeing and level of depression stemming from the mobile connectivity is illustrated based on  different variables (e.g., age, gender, geographic location, academic year of study, sleeping habits, etc.) that provide new insights into how these factors are related to students' depression.
Second, an automated model of depression detection and classification process is explored by employing eight ML algorithms namely Gaussian Naïve Bayes, Logistic Regression, Random Forest, Support Vector Machine, Light Gradient Boosted Machine, Multi-layer Perceptron, CatBoost, and K-Nearest Neighbor.The automated detection is examined by incorporating feature selection (Chi-square test and Recursive Feature Elimination) and feature extraction (Principal Component Analysis and SparsePCA) methods to find out the best performing ML pipeline.This provides direction for the researcher to consider the most suitable model for the detection of student depression in future studies.
Finally, this study is intended to serve as a valuable insight into the body of depression literature and for the policymakers, teachers, educators, and medical practitioners globally in different disciplines and domains.The global pandemic has not only disrupted the physical delivery of education, but it also affected students' mental health.In this regard, this study provides new insights into how the students' mental health is affected by the overuse of mobile devices and internet connectivity.

Limitations of the study
There are a few limitations to this study.Since this study constitutes data primarily from Bangladesh along with other OIC countries, the inhabitants from these regions are the most represented in the results.The respondents only consisted of students in this study.The sample could be more diversified so that other classes and professions of society could be studied.The self-reported survey data is subject to produce a misinterpretation of the data with possible biases.Also, this study is investigative in nature and no experimental design was employed to compare study findings with a set of clinical samples.The sample size is moderate for this domain since the difficulty of acquiring mental healthrelated data is quite high.In particular, higher age groups and overseas respondents contributed a small portion to the survey data which suffers from being versatile and may not be truly representative for all the participants.On the technical side, the class imbalance issue as well as the over-performance of some classifiers were also evident when the ML model has been trained with the SparsePCA method.Though the results were mostly representative of all five classes, there was some imbalance in the dataset.This can affect the outcome of the study and make the results more applicable to those classes that have a higher percentage of students.Also, data balancing techniques have not been employed in this study.We intend to make a comparative analysis between the balanced and imbalanced nature of this dataset in future work.
Methodologically, this is an investigative study and no interpretability methods like SHAP or LIME have been employed in this study.The interpretability methods may have different findings and interpretations that can shift the focus of the study in a different direction.Lastly, the study did not utilize any modality such as video or images other than textual responses and thus the findings may be less comprehensive in nature.

Conclusion
Mental health is becoming of increasing concern worldwide.In a South Asian country such as Bangladesh, the importance of mental health-related disorders is even more since this country is plagued with so many different confounding issues.As a result, the factors that play a crucial role in instigating depression cannot be circumscribed with statistical analysis alone.Furthermore, online or hybrid modes of learning have a higher probability of becoming more accepted in the future and thus further accelerate the adoption of continual mobile  connectivity among educators and students engaged in teaching and learning.Thus, in anticipation of a substantial increase in mental health issues in response to the new normal mode of delivery, it seems timely for a smart healthcare system to be alert to the need to provide greater support to depressed individuals.
In this research, the potential vulnerability of student groups to depression was assessed through an online survey employing different ML models with multiple feature engineering methods to extract the best-automated depression detection pipeline.Around 44% of the respondents were found to be rated "normal" whereas a total of 47% of respondents were found to be rated at the moderate-to-extremely severe depression level.Besides, data analysis also revealed some insights about depression among the students based on different age groups and ethnicity.
The performance analysis of the ML models reveals that when feature engineering methods like the Chi-square test and RFE are added, the accuracy metric increases from 3% to 5%.In a similar vein, PCA and SparsePCA methods increased the accuracy level from 5% to 15%.The SparsePCA method of feature extraction combined with the CatBoost classifier performed the best.Moreover, a comparison with other research works has been demonstrated to estimate the overall condition and the factors that are responsible for causing depression.In future, a larger survey may be conducted with varying cultural and societal differences to reveal a broader picture of students' levels of depression.Besides, neural networks need to be examined to achieve a better understanding of the automated psychiatric consultancy process.

Table 9 . Result of the data analysis among ethnicity groups (South Asia and Southeast Asia).
Overall, the RFE method comparatively better assisted the ML models than the Chisquare test in this study's findings as shown in Figs15 and 16.When the ML models were trained with the feature extraction PCA method, the LR model consistently performed better than other models by achieving 90.55%, 90.35%, and 0.9943 for accuracy, F1-score, and ROC-AUC score respectively (Tables https://doi.org/10.1371/journal.pone.0294803.t009methods.3.3.2Comparison between PCA and SparsePCA.